NAT2 global landscape: Genetic diversity and acetylation statuses from a systematic review

Arylamine N-acetyltransferase 2 has been related to drug side effects and cancer susceptibility; its protein structure and acetylation capacity results from the polymorphism’s arrays on the NAT2 gene. Absorption, distribution, metabolism, and excretion, cornerstones of the pharmacological effects, have shown diversity patterns across populations, ethnic groups, and even interethnic variation. Although the 1000 Genomes Project database has portrayed the global diversity of the NAT2 polymorphisms, several populations and ethnicities remain underrepresented, limiting the comprehensive picture of its variation. The NAT2 clinical entails require a detailed landscape of its striking diversity. This systematic review spans the genetic and acetylation patterns from 164 articles from October 1992 to October 2020. Descriptive studies and controls from observational studies expanded the NAT2 diversity landscape. Our study included 243 different populations and 101 ethnic minorities, and, for the first time, we presented the global patterns in the Middle Eastern populations. Europeans, including its derived populations, and East Asians have been the most studied genetic backgrounds. Contrary to the popular perception, Africans, Latinos and Native Americans have been significantly represented in recent years. NAT2*4, *5B, and *6A were the most frequent haplotypes globally. Nonetheless, the distribution of *5B and *7B were less and more frequent in Asians, respectively. Regarding the acetylator status, East Asians and Native Americans harboured the highest frequencies of the fast phenotype, followed by South Europeans. Central Asia, the Middle East, and West European populations were the major carriers of the slow acetylator status. The detailed panorama presented herein, expands the knowledge about the diversity patterns to genetic and acetylation levels. These data could help clarify the controversial findings between acetylator states and the susceptibility to diseases and reinforce the utility of NAT2 in precision medicine.

5. "Do the indels have any role in the phenotypic presentation of NAT2? There is no mention of indels".
Thank you for this observation. We appreciate it. We have included this information in the Introduction section with its respective citations. We agree with your comment, and given the observations made by Reviewer 2 we have eliminated the statements among lines 154 to 157. This is because haplotype phenotype reconstruction could lack accuracy and detract from the actual outcomes.

"
The studies carried out on Asian living in western countries pose another problem. They are increasingly inbred even more than their related populations in their native countries. Sometime these studies do not reflect the true picture of larger population. Did the authors scrutinize this fact? https://www.science.org/doi/10.1126/science.aac8624. The authors mentioned that 70% percent of Asian studies belongs to China and Japan and these populations have a different genetic structure than south Asian regions. This will apparently limit the wider applicability of results in Asian population".
We thank the reviewer for this remark. We agree with this observation that we had not considered. Your recommendations have been attended to in the Discussion section using the article suggested and others associated with this topic. We apologize for this misunderstanding, and many thanks for this suggestion. Firstly, the number of haplotypes you pointed out (i.e., 8,551 and 7,874) depends on the sample size. The misunderstanding was our mistake. In order to be more accurate, we have clarified your concerns in the Results description in all the sections. Indeed, the Americas presented 31 different haplotypes, whereas Asia only sixteen. About the differences in the diverse patterns (Asia versus the Americas as well as in the Americas versus Europe), we have included the answer to your questions in the Discussion section.
In this manuscript, the authors report on a systematic review of all published data on 7 polymorphisms in the coding exon, plus a regulatory variant, of the gene NAT2 in world-wide human populations. The authors screened more than 1000 publications so as to retrieve, after seemingly careful quality control, frequencies of NAT2 alleles (i.e. SNPs) and/or genotypes (at these SNPs), and/or haplotypes and/or phenotypes in 321 diverse human populations/groups. They then used this dataset to document the haplotypic and phenotypic diversity of NAT2 in humans. I particularly appreciate the initiative and effort as we undertook a similar process some 15 years ago that gave rise to a publication that the authors cite (Sabbagh et al. 2011). Hence, I believe that the study is valuable and I understand the amount of work devoted to it, but unfortunately the manuscript is not, in my view, ready for publication. Actually, I believe that it still needs substantial additional thinking and work. The main reason for my reluctance to accept it in its present state is that, besides that the manuscript includes several vague assertions, the study seems to suffer from some methodological problems and even what I believe to be mistakes. You are right. As you mentioned, our work was hard, and we made a mistake in including data obtained from fewer SNPs contravening our inclusion criteria. These data have been removed from maps but were maintained in Table S1. Likewise, we have clarified that only those acetylator statuses obtained from at least six SNPs were used for the analyses. These concerns were modified and clarified in the section Materials and Methods, reverberating in the Results section.

"Another example of this problem of resolution level is provided in the Mat & Met section,
where the authors state (I quote) "In those cases where the authors did not define the specific haplotype (i.e., NAT2*5A) and only reported the general haplotype (i.e., NAT2*5), the most frequent one was taken by default." What does this mean ? If authors reported that they found haplotype *5 in their publication, and not *5A, it is probably because they hadn't genotyped the marker that defines the *5A series (according to Supplementary

of resolution is needed and should be well explained in the manuscript (or if needed in a Supplementary text file). I would advise the authors to start by deciding of a set of SNPs that have to be included for a sample to be considered, then infer the haplotype resolution possible for that set of SNPs, and infer the genotypes and phenotypes on that basis. If the authors still
wish to IMPUTE some data (as they seem to have done with the *5/*5A example), then this should be clearly stated in Table S1 as well".
We agree, and we thank you for this observation. We have modified these sentences, and the SNPs considered for acetylator statuses, haplotypes, and network analyses have been defined in the Materials and Methods section. We hope your concern has been clarified with these modifications. We agree with these suggestions. Thus, we have clarified these paragraphs in the Material and Methods section.

"Linked
6. "Another major concern I have is related to the networks. The authors produced 7 networks of haplotypes, 1 global and 6 by geographic region (or sub-region/sub-groups, figures 3 to 7).

I might be wrong here, but I believe that branch lengths in networks should be proportional to the number of mutational steps separating nodes (haplotypes) ? In those figures, this is not the case. For instance, in Figure 3, I see a branch linking *13A and *6F, with 3 mutational steps on this very short branch, whereas the very long branch linking *6A and *6J has only one mutational step indicated. Such a network is distorted, I believe, and does not properly
represent what it should : the molecular diversity of haplotypes and the most parsimonious evolutionary steps linking those haplotypes. At present, these networks are unreadable, with many difficult to understand reticulations. Moreover, they are presented in such a way that one cannot directly compared them between the diverse world regions considered. For instance, Figure 3 has *5B on the right and *6A and the upper-left, whereas Figure 4 has *5B on the lower-left and *6A on the upper-right. I recommend some standardization here too. But most importantly, what are these networks used for ? At present, I only read some statements on the frequency of singletons… but without any mention to sample size. Identification of singletons is more probable as sample size increases. For instance, the authors state that (I quote) "Of note are the singletons and the regional haplotypes in Latinos and Native Americans (Figs 5B and 5C).", and "The admixed populations from Brazil and Mexico exhibited many singletons and local haplotypes (*6J, *7C, and *7E) (Fig 5B)". Indeed, in Figure 5B for instance, most singletons are from Brazil (and most of the haplotypic diversity indeed), not from Nicaragua and Peru. But look at the sample sizes in Thank you so much for this comment. We have used the continent subdivision proposed by the World Atlas webpage. In this setting, we considered that this subdivision could be better for our comparisons. This information appears in the Material and Methods section. Many thanks for these suggestions. We have been more careful with the use of adjectives.  (Figs 9 -15)." How much higher ? Such assertions (of which there are many instances of throughout the manuscript) should be grounded on numbers (and if possible, statistical tests of the differences)".
We acknowledge your comment. We have included a table (Table 1) with this data as well as the statistical analyses.
11. "Related to this last point, I would also like to advise the authors to maybe read again our 2011 publication, in which we showed, as the authors stated in their abstract that (I quote) "NAT2*4, *5B, and *6A were the most frequent haplotypes globally. Nonetheless, the distribution of *5B and *7B were less and more frequent in Asians, respectively." This is exactly what we had shown more than 10 years ago (see Sabbagh et al. 2011, Figure 2 We agree with this comment. In order to follow your suggestion, we have modified it. Thank you so much for giving us free access to this chapter. 12. "Then, except for Figure 11, all those figures showing the distribution of single SNPS are irrelevant, because they are not mutually independent. The information on single SNPs is included in the information on haplotypes. If Cambodians have a high frequency of rs1799931-A, it is due to the fact that they have a high frequency of haplotypes *7 (1 x *7A and 5 x *7B), but note the sample size: 7 individuals".
Thank you so much for this suggestion. Nonetheless, we considered that the information provided by a figure is valuable and could give a general idea about the frequency distribution. We have modified the presentation of the frequencies using area plots, and we have included this information in the Supplemental Information section.  Table S1 represent ? Or correct "Missence" to "Missense" in Table S1. Or what is the "ancestral burden" (abstract) ? Or which are "the controversial findings between acetylator states and the susceptibility to diseases" (abstract) ? But this is altogether already a very lengthy review".
Thank you so much for this observation. The sentence was reformulated.
I hope that all my remarks are constructive and will help the authors to substantially improve their manuscript. Again, I believe that the study is valuable, but it has to be consequently revised to get ready for publication.
We agree with your suggestions. We attended them because we are sure that your expertise has improved our manuscript's quality and information. Thank you so much!